EPrints Technical Mailing List Archive

Message: #00478

[EP-tech] Re: Garbage indexing some pdf

To: eprints-tech@ecs.soton.ac.uk
Subject: [EP-tech] Re: Garbage indexing some pdf
From: Paolo Tealdi <paolo.tealdi@polito.it>
Date: Thu, 03 May 2012 10:01:48 +0200

On 04/27/2012 02:02 PM, rchilliard@mun.ca wrote:
Hi,

regarding to your p.s., we noticed on our repository many records withwords badly indexed (with non-breaking space character, or othersimilar stuff).A (very) dirty quick patch for Tokenize.pm to add the most frequentbreaking characters found in our fulltext.


Index: Tokenizer.pm
===================================================================
--- Tokenizer.pm    (revision 323)
+++ Tokenizer.pm    (working copy)
@@ -259,8 +259,63 @@
     '.' => 1,     '/' => 1,     ':' => 1,     ';' => 1,
     '{' => 1,     '<' => 1,     '|' => 1,     '=' => 1,
     '}' => 1,     '>' => 1,     '~' => 1,     '?' => 1,

- chr(0xb4) => 1, chr(0x27)=>1, '{' => 1, '}' => 1 # AcuteAccent (closing quote)

+    chr(0xb4) => 1, chr(0x27)=>1,   '{' => 1,       '}' => 1,
+chr(0x81) => 1,
+chr(0x83) => 1,
+chr(0x00a0) => 1,
+chr(0x0090) => 1,
+chr(0x0099)  => 1,
+chr(0x009c)  => 1,
+chr(0x009d) => 1,
+chr(0x02B9) => 1, # ca b9    MODIFIER LETTER PRIME
+chr(0x02BA) => 1, # ca ba    MODIFIER LETTER DOUBLE PRIME
+chr(0x02BB) => 1, # ca bb    MODIFIER LETTER TURNED COMMA
+chr(0x02BC) => 1, # ca bc       MODIFIER LETTER APOSTROPHE
+chr(0x02BD) => 1, # ca bd    MODIFIER LETTER REVERSED COMMA
+chr(0x02BE) => 1, # ca be    MODIFIER LETTER RIGHT HALF RING
+chr(0x02BF) => 1, # ca bf    MODIFIER LETTER LEFT HALF RING
+chr(0x2000) => 1, # e2 80 80    EN QUAD
+chr(0x2001) => 1, # e2 80 81    EM QUAD
+chr(0x2002) => 1, # e2 80 82    EN SPACE
+chr(0x2003) => 1, # e2 80 83    EM QUAD
+chr(0x2004) => 1, # e2 80 84    THREE-PER-EM SPACE
+chr(0x2005) => 1, # e2 80 85    FOUR-PER-EM SPACE
+chr(0x2006) => 1, # e2 80 86    SIX-PER-EM SPACE
+chr(0x2007) => 1, # e2 80 87    FIGURE SPACE
+chr(0x2008) => 1, # e2 80 87    PUNCTUATION SPACE
+chr(0x2009) => 1, # e2 80 87    THIN SPACE
+chr(0x200A) => 1, # e2 80 87    HAIR SPACE
+chr(0x200B) => 1, # e2 80 87    ZERO WIDTH SPACE
+chr(0x2024)  => 1, # e2 80 a4    ONE DOT LEADER
+chr(0x2025)  => 1, # e2 80 a5   TWO DOT LEADER
+chr(0x2026)  => 1, # e2 80 a6   HORIZONTAL ELLIPSIS
+chr(0x2027)  => 1, # e2 80 a7   HYPHENATION POINT
+chr(0x2028)  => 1, # e2 80 a8   LINE SEPARATOR
+chr(0x2029)  => 1, # e2 80 a9   PARAGRAPH SEPARATOR
+chr(0x2018) => 1,  # e2 80 98    LEFT SINGLE QUOTATION MA
+chr(0x2019) => 1, # e2 80 99    RIGHT SINGLE QUOTATION MARK
+chr(0x201c) => 1, # e2 80 9c    LEFT DOUBLE QUOTATION MARK
+chr(0x201d) => 1,  # e2 80 9d    RIGHT DOUBLE QUOTATION MARK
+chr(0x2010) => 1,  # e2 80 90    HYPHEN
+chr(0x2011) => 1,  # e2 80 91    NON-BREAKING HYPHEN
+chr(0x2012) => 1,  # e2 80 92    FIGURE DASH
+chr(0x2013) => 1,  # e2 80 93    EN DASH
+chr(0x2014) => 1,  # e2 80 94    EM DASH
+chr(0x2015) => 1,  # e2 80 95    HORIZONTAL BAR
+chr(0xFB00) => 1,  #ef ac 80    LATIN SMALL LIGATURE FF
+chr(0xFB01) => 1,  #ef ac 81    LATIN SMALL LIGATURE FI
+chr(0xFB02) => 1,  #ef ac 82    LATIN SMALL LIGATURE FL
+chr(0xFB03) => 1,  #ef ac 83    LATIN SMALL LIGATURE FFI
+chr(0xFB04) => 1,  #ef ac 84    LATIN SMALL LIGATURE FFL
+chr(0xFB05) => 1,  #ef ac 85    LATIN SMALL LIGATURE LONG S T
+chr(0xFB06) => 1,  #ef ac 86    LATIN SMALL LIGATURE ST
+chr(0xFFF9 ) => 1,  #ef bf b9 INTERLINEAR ANNOTATION ANCHOR
+chr(0xFFFA ) => 1,  #ef bf ba INTERLINEAR ANNOTATION SEPARATOR
+chr(0xFFFB ) => 1,  #ef bf bb INTERLINEAR ANNOTATION TERMINATOR
+chr(0xFFFC ) => 1,  #ef bf bc OBJECT REPLACEMENT CHARACTER
+chr(0xFFFD ) => 1  #ef bf bd REPLACEMENT CHARACTER
 };
+

$EPrints::Index::FREETEXT_SEPERATOR_REGEXP = quotemeta(join "", keys%$EPrints::Index::FREETEXT_SEPERATOR_CHARS);$EPrints::Index::FREETEXT_SEPERATOR_REGEXP =qr/[$EPrints::Index::FREETEXT_SEPERATOR_REGEXP\x00-\x20]/;


Best regards,
Paolo

Hi Paolo,

    I took a quick peek at the sample that you were able to provide, and it looks like the character mapping is missing for the content text. If you export the PDF to text via Acrobat or equivalent, you can note via hex editor that the output text file has all characters mapped to ascii(0x2e), via a vanilla run of pdftotext (e.g. pdftotext test.pdf test.txt), the characters are mapped to ascii(0x20) and in unicode from pdftotext (as the command run by the indexer ~= pdftotext -enc UTF-8 -test.pdf test_utf.txt) you get the byte sequence "ef 80 bd" for each character.

    It may be possible to retroactively reconstitute the mapping information, but I'm not aware of a mechanism to do perform that operation. As well, it appears that this might have been done purposely when the PDF was generated - most tellingly, the licensing / attribution information at the conclusion of the file is mapped properly.

p.s. thank you for the note / query on testing the indexed word lengths, it has notified us of a potential issue in our repository (and possibly others'?) whereby multiple words are being indexed in clusters because they are not tokenized on non-breaking space ('&nbsp') characters.


--
Ing. Paolo Tealdi         Area IT - Politecnico Torino
Telefono/Phone : +39-011-0906714 , FAX : +39-011-0906799
Indirizzo/Address : C.so Duca degli Abruzzi,  24 - 10129 Torino - ITALY
Skype : tealdi.paolo
Please consider your environmental responsibility before printing this e-mail

References:
- [EP-tech] Garbage indexing some pdf
  - From: Paolo Tealdi <paolo.tealdi@polito.it>
- [EP-tech] Re: Garbage indexing some pdf
  - From: "Manojlovich, Slavko" <slavko@mun.ca>
- [EP-tech] Re: Garbage indexing some pdf
  - From: Paolo Tealdi <paolo.tealdi@polito.it>
- [EP-tech] Re: Garbage indexing some pdf
  - From: Paolo Tealdi <paolo.tealdi@polito.it>
- [EP-tech] Re: Garbage indexing some pdf
  - From: <rchilliard@mun.ca>

Prev by Date: [EP-tech] Re: EPrints COUNTER Compliance
Next by Date: [EP-tech] EPrints::DataObj::Document - add_archive
Previous by thread: [EP-tech] Re: Garbage indexing some pdf
Next by thread: [EP-tech] Re: Garbage indexing some pdf
Index(es):
- Date
- Thread